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Abstract We consider the question of Markov chain Monte 
Carlo sampling from a general stick-breaking Dirichlet pro- 
cess mixture model, with concentration parameter a. This 
paper introduces a Gibbs sampUng algorithm that combines 
the slice sampling approach of Walker' ('200711 and the retro- 
spective sampling approach of Papaspiliopoulos and Roberts 
(1200811. Our general algorithm is implemented as efficient 



open source C-n- software, available as an R package, and is 
based on a blocking strategy similar to that suggested by ;Pa^ 
|paspiliopoulos| ( [2008) l and implemented by |Yau et al| ( |201 1| ). 

We discuss the difficulties of achieving good mixing in 
MCMC samplers of this nature and investigate sensitivity to 
initialisation. We additionally consider the challenges when 
an additional layer of hierarchy is added such that joint infer- 
ence is to be made on a. We introduce a new label switching 
move and compute the marginal model posterior to help to 
surmount these difficulties. Our work is illustrated using a 



profile regression (Molitor et al 20101 application, where 
we demonstrate good mixing behaviour for both synthetic 
and real examples. 
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1 Introduction 

Fitting mixture distributions to model some observed data is 
a common inferential strategy within statistical modelling, 
used in applications ranging from density estimation to re- 
gression analysis. Often, the aim is not only to fit the mix- 
ture, but additionally to use the fit to guide future predic- 
tions. Approaching the task of mixture fitting from a para- 
metric perspective, the task to accomplish is to cluster the 
observed data and (perhaps simultaneously) determine the 
cluster parameters for each mixture component. This task is 
significantly complicated by the need to determine the num- 
ber of mixture components that should be fitted. An increas- 
ingly popular alternative approach to parametric modelling 
is to adopt a Bayesian non-parametric approach, fitting an 
infinite mixture, thereby avoiding determination of the num- 
ber of clusters. The Dirichlet process ( |Ferguson||1973| l is a 
well studied stochastic process that is widely used in Bayesian 
non-parametric modelling, with particular applicability for 
mixture modelling. Draws from a Dirichlet process are them- 
selves probability measures, with the property that all of 
their marginal distributions are Dirichlet distributions. 

Our general algorithm, a DPMM (Dirichlet process mix- 
ture model) sampling algorthm, is implemented as efficient 
open source C-n- code also available as an R package (jLiv 



erani et al 2013 1. Our sampler is highly extensible, but cur- 
rently permits sampling of Gaussian or categorical discrete 
mixtures, as well as more general problems, previously re- 
ferred to collectively as "profile regression" ( jMolitor et al| 
2010[ Papathomas et al 2011| l. The implementation also han- 
dles DPMM sampling with simultaneous variable selection, 
as discussed in Papathomas et al^(2012) . 

Whilst previous methods have explored ways of sam- 
pling from the full DPMM, it appears that little discussion 
has been presented detailing the implicit difficulties of using 
a Gibbs (or Metropolis-within-Gibbs) sampling approach to 
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update such a complex model space. In particular, for real 
(rather than synthetic) data applications of the DPMM, the 
state space can be highly multimodal, with well separated 
regions of high posterior probability co-existing, often cor- 
responding to clusterings with different number of compo- 
nents. We demonstrate that such highly multimodal spaces 
present difficulties for the existing sampling methods to es- 
cape the local modes, with poor mixing resulting in infer- 
ence that is influenced by sampler initialisation. In the most 
serious case, this can be interpreted as non-convergence of 
the MCMC sampler. 

A secondary mixing issue relates to the mixing across 
the ordering of clusters in a particular clustering process. 
As we shall detail, such issues are particularly important 
when simultaneous inference is desired for the concentra- 
tion parameter a. This mixing issue was highlighted by Pa-i 



Ipaspiliopoulos and Roberts^ (2008) who observed that the 
inclusion of label-switching moves can help to resolve the 
problem. We demonstrate that the moves that they propose 
offer only a partial solution to the problem, and we suggest 
an additional label switching move that appears to enhance 
the performance of the sampler. 

In the following section, we present the details of the 
Dirichlet process mixture model, introducing our sampler 
for this model and the intricacies involved therein. This is 
followed by Section [3] where we present specific implemen- 
tation details for various different mixture and profile regres- 
sion models. Section fflbriefly discusses the post-processing 
methods that we use for the rich Bayesian output. SectionIS] 
explores the issues of mixing for DPMM samplers, present- 
ing a sensitivity analysis to sampler initialisation, followed 
by details of the computation of the marginal model poste- 
rior and label-switching functionality built in our sampler 
for resolving these issues. In Section|6]we present a demon- 
stration of the applicability of our sampler to synthetic and 
real data examples, before conclusions are presented in Sec- 
tion H 



2 Dirichlet process mixture models 

A variety of ways have been used to show the existence of 
the Dirichlet Process, using a number of different formu- 



lations (Ferguson 1973 Blackwell and MacQueen 1973 1. 
In this paper we focus on Dirichlet process mixture mod- 
els (DPMM), based upon the following simplified construc- 



( fT994l ).If 

CXD 



Oc - P0„ forceZ+, 

V-c = V, Ha ~ V,) for c e Z+ \ {!}, (1) 

Kc 

-01 ^ Vi, and 

Vc -Beta(l,a) forceZ+, 

where 6x denotes the Dirac delta function concentrated at 
X, then P ^ DP (a, Pep). This formulation for V and i/j 
is known as a stick-breaking distribution. Importantly, the 
distribution P is discrete, because draws 0i,02, ■ ■ ■ from 
P can only take the values in the set {0c : c E Z+}. 

It is possible to extend the above formulation to more 



general stick-breaking formulations (Ishwaran and James 



[200T] [KaliTetall |20TT] [Pitman and Yor| 1 1 997| l 



2.1 Sampling from the Dirichlet process mixture model 

More recently, two alternative innovative approaches to sam- 
pling the full DPMM have been proposed. The first, intro- 
duced by Walker ( 2007[ l and generalised by Kalli et al ( 20 1 1 1. 
uses a novel slice sampling approach, resulting in full condi- 
tionals that may be explored by the use of a Gibbs sampler. 
The second distinct MCMC sampling approach was pro- 
posed in parallel by [Papaspiliopoulos and Roberts] ( |2008| ). 
The proposed sampler again uses a Gibbs sampling approach, 
but is based upon an idea termed retrospective sampling, 
allowing a dynamic approach to the determination of the 
number of components (and their parameters) that adapts as 
the sampler progresses. The cost of this approach is an in- 
geneous but complex Metropolis-within-Gibbs step, to de- 
termine cluster membership. 

Despite the apparent differences between the two strate- 
gies, |HpaspiIi^ouios| ([2008]) noted that the two algorithms 
can be effectively combined to yield an algorithm that im- 
proves either of the originals. The resulting sampler was im- 

( 201111, and a similar 



plemented and presented by 



Yau et al 



version was presented by Dunson (2009)1. The current work 



presented in this paper is our interpretation of these ideas, 
implemented as efficient C-n- code within the R package 



tive definition of the Dirichlet process, due to Sethuraman 



PReMiuM ( [Liverani etal] |2013[ ) for general DPMM sam- 
pling. 

For the Dirichlet process mixture model (DPMM), the 
(possibly multivariate) observed data D = {Di,D2, ■ ■ ■ , Dn) 
follow an infinite mixture distribution, where component c 
of the mixture is a parametric density of the form /c(-) = 
f{-\0c,A) parametrised by some component specific pa- 
rameter 0c and some global parameter A. Defining (latent) 
parameters 6?i , 02 , ■ • ■ , 0n as draws from a probability dis- 
tribution P following a Dirichlet process DP{a, Poq) and 
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again denoting the dirac delta function by S, this system can 
be written, 

D,\O,,A^fiD,\0,,A) forz = l,2,...,n, (2) 



OO 

01 ^ ^ tpcSe, for i = l,2,...,n. 



When making inference using mixture models (either fi- 
nite or infinite) it is common practice to introduce a vector 
of latent allocation variables Z. Such variables enable us to 
explicity characterise the clustering and additionally facili- 
tate the design of MCMC samplers. Adopting this approach 
and writing -0 = {%l)i-,'i>2-, ■ ■ •) and & = {Oi,02, ■ ■ ■), we 
re- write Equation l2] as 

D,\Z,0,A^ f{D,\0z, ,A) for z = 1, 2, . . . , n, 
Oc -- Poo force Z+, 
F{Zi =c\ip) =%l)c forceZ+, i^l,2,...,n. (3) 

See [Liverani et al| ( |2013 ) for further details on our im- 
plementation of the Dirichlet process mixture model. 



4 Post processing 

Given a sample of partitions from the posterior distribution 
of a Bayesian cluster model, where the sample is the out- 
put of an MCMC algorithm, it is often desirable to summa- 
rize the sample with a single representative clustering esti- 
mate. These include the maximum a posteriori (MAP) esti- 
mate and methods based on the posterior similarity matrix, a 
matrix containing the posterior probabilities that the obser- 
vations i and j are in the same cluster. We find that methods 
based on the posterior similarity matrix are less suceptible 
to Monte Carlo error, especially when the optimal partition 
is obtained using clustering methods, such as partitioning 
around medoids, that take advantage of the whole MCMC 



output ([Mohtor et al 20101. 



Moreover, we also allow predicted values to be com- 
puted based on probabilistic allocations, or using a Rao- 
Blackwellised estimate of predictions, where the probabil- 
ities of allocations are used instead of performing draws. 



3 Example Models 

The general sampler of the previous section is applicable for 
many specific models, depending on the choices of / and 
Pqo . Our implementation allows for mixtures of Gaussian 
covariates, discrete covariates or a combination of the two, 
assuming independence between continuous and categorical 
data conditional on the cluster allocations ( [Liverani et alj 
|20T3] l. 

In particular, we are interested in using DPMM as an al- 
ternative to regression models, non-parametrically linking a 
response vector Y to covariate data X through cluster mem- 
bership. This idea has been used by several authors includ- 
ing pmionetal 



tor et al 



( |20T0l i, 



( 2008[ l, Bigelow and Dunson ( 



Papathomas et al ( 201 l[ l, and 



2009) 1, 



Moli- 



Mohtor et al 



( 2011| l. Our presentation is most similar to the latter three 
of these papers and we refer to it as profile regression. For- 
mally, the data D = (V , X, W) is now extended to contain 
response data Yi and covariate data Xi for each individual i, 
where the contribution of the covariate data to the response 
may be cluster dependent. There is also the possibility to in- 
clude additional fixed effects Wi for each individual, which 
are constrained to only have a global (i.e. non-cluster spe- 
cific) effect on the response Yi. 

The data Di is then jointly modelled as the product of a 
response model and a covariate model, to give the following 
likelihood: 

p{D,\Z„e, A) = fY{Y^0z,,A, Wi)fx{X,\0z^,A). 

The likelihood /y depends upon the choice of response 
model. Our R package allows Bernoulli, Binomial, Poisson, 
Normal and categorical response, as well as Normal and dis- 
crete covariates. 



5 Mixing of the MCMC algorithm 

While the design and implementation of an MCMC sampler 
to sample the infinite DPMM has been the focus of previous 
research, little appears to be written about monitoring the 
performance of such a sampler in realistic applications, in 
particular with respect to investigating its convergence and 



mixing. Papaspiliopoulos and Roberts ( 2008 1 briefly discuss 
the importance of the inclusion of label switching to im- 
prove mixing across cluster orderings, and while we agree 



with the need for such moves (see Section 5.1 1, in our expe 



rience more fundamental mixing issues can affect the sam- 
pler for real-data applications. In this section we address 
these issues. 



5.1 Ordering and the concentration parameter 

One area that requires attention is the mixing of the algo- 
rithm over cluster orderings. In particular, whilst the likeli- 
hood of the DPMM is invariant to the order of cluster labels, 
the prior specification of the stick breaking construction is 
not. As detailed by Papaspiliopoulos and Roberts] ( |2008[ ), 
the definition of the ipc in terms of Vc, imposes the rela- 
tion E[-0c] > ]E[^c+i] for all c. This weak identifiability, 
discussed in more detail by |Porteous et al ( |2006| l, also man- 
ifests itself through the result P(V'c > V'c+i) > 0.5 for all 
c, a result that we prove in Appendix |A.l| 

The importance of whether the algorithm mixes suffi- 
ciently across orderings depends partially upon the object 
of inference. Specifically, since P('0c > i'c+i) depends 
upon the prior distribution of a, if inference is to be simul- 
taneously made about a (as is the scenario considered in 
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this paper), it is very important that the algorithm exhibits 
good mixing with respect to a. If this was not the case, 
the posterior marginal distribution for a would not be ad- 
equately sampled, and since a is directly related to the num- 
ber of non-empty clusters (see ' Antoniak|1 974' for details), 
poor mixing across ordering may further inhibit accurate in- 
ference being made about the number of non-empty clus- 
ters. This situation would be further exaggerated for more 
general stick breaking constructions (of the sort mentioned 
in the introduction). While our sampler includes the possi- 
bility of setting a fixed value of a, more generally we wish 
to allow a to be estimated. 



5.1.1 Label switching 

To ensure adequate mixing across orderings, it is impor- 



Acceptance Rate for move 1, with moves 1-2 implemented 



tant to include label-switching moves, as observed by Pa- 



paspiliopoulos and Roberts ( 2008 1. Without such moves, the 
one-at-a-time updates of the allocations Zi, mean that clus- 
ters rarely switch labels, and consequentially the ordering 
will be largely determined by the (perhaps random) initial- 
isation of the sampler For all choices of a, the posterior 
modal ordering will be the one where the cluster with the 
largest number of individuals has label 1, that with the sec- 
ond largest has label 2 and so on. However, a affects the 
relative weight of other (non-modal) orderings, and a prop- 
erly mixing sampler must explore these orderings according 
to their weights. 

We adopt the label-switching moves suggested by 'Pa-' 
[paspiliopoulos and Roberts|p008j ), and details can be found 
therein. However, in our experience, while these moves may 
experience high acceptance rates early on in the life of the 
sampler, once a "good" (in terms of high posterior support) 
ordering is achieved, the acceptance rates drop abruptly. This 
means that there is little further mixing in the ordering space. 



For the first of the moves that Papaspiliopoulos and Roberts 



( 2008| l propose, where the labels of two randomly selected 
clusters are exchanged, we observed acceptance rates below 
10% for any sample of 500 sweeps. For the second of the 
moves, where the labels of two neighbouring clusters are 
swapped, along with the corresponding Vc, Vc+i, Figure [T] 
demonstrates the decrease in acceptance rate. This decrease 
can be explained by the observation (made by the original 
authors) that the second move type is always accepted if one 
of the clusters is empty, which can happen often in initial 
cluster orderings with low posterior support. However, our 
concern is that while these label-switching moves appear to 
encourage a move towards the modal ordering, once that or- 
dering is attained, the sampler rarely seems to escape too far 
from this ordering. 

Our solution is to introduce a third label switching move 
that we describe here. In brief, the idea is to simultaneously 
propose an update of the new cluster weights so they are 



Acceptance Rate for move 2, witPi moves 1-2 implemented 



Samples from posterior distribution of alpha 




Fig. 1: Acceptance rate for intervals of 500 sweeps for the 
two label switching moves proposed by |Papaspiliopoulos| 
|and Roberts] ( |2008| ) and comparison with samples from the 
posterior distribution of a (bottom). Note that convergence 
appears to be achieved after 5,000 iterations for the example 
shown. If only the first of the two moves is implemented, 
alpha moves extremely slowly (more than 50,000 iterations 
are not enough, not shown) while if only the second of the 
two moves is implemented, for this example, 17,000 itera- 
tions are necessary for alpha to converge (not shown). 



something like their expected value conditional upon the 
new allocations. Specifically, defining Z* ~ maxi<i<„ Zi 
and A = {c G Z+ : c < Z*} the move proceeds as follows: 
first choose a cluster c randomly from A \ {Z*}. Propose 
new allocations 



z: 




i : Zi — c 
i : Zi — c + 1 
otherwise. 



(4) 
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and switch parameters associated to these clusters such that 



0C+1 I = C 

01 otherwise. 



(5) 



Additionally, propose new weights ip'^ and V'c+i for com- 
ponents c and c + 1 such that 



, V,+ E[-4,^\Z',a] 



E[4>^\Z,a] 



i>i 



l = c 

I ^ c+1 and 

otherwise. 



(6) 



where tp^ — 4'c + 4'c+i and 



'^' = A+i 



by setting 



E[4,,\Z',a] 
E[i;,+i\Z,a] 



■ V'c 






i>'a 



n,<e(i-vi) 



(i-^c')n,<e(i-vi) 

Vi 



I = c 
l = c+l 
otherwise. 



(7) 



All other variables are left unchanged. Assuming that there 
are ric and n^+i individuals in clusters c and c + 1 respec- 
tively at the beginning of the update, the acceptance proba- 
bility for this move is then given by min{l, R] where 

R^\ , „ " , ^1 i?"'^+'i?2% where (8) 



Ri 

i?2 



IpcRl +1pc+lR2, 
l+a + Tlc+l + J2l>c+1 "-' 

a + Tic+i + E;>c+i "i 



and 



(9) 
(10) 



More details can be found in Appendix A.2 



5.2 Sensitivity to initiahsation and monitoring convergence 

In practice, convergence is a key task of users of complex 
MCMC algorithms. For our sampler, with all the moves de- 
scribed in this paper fully implemented, simulated data with 
a strong underlying signal will convergence in a few itera- 
tions to the generating partition, even with substantially dif- 
ferent initialisations. However, for real data, where the sig- 
nal might not be very strong, different initialisations can re- 
sult in substantially different partitions, indicating that con- 
vergence in the highly multi-modal model space remains a 
significant challenge. 

Accepting that the challenge persists, it is clearly impor- 
tant that the user has diagnostic methods to assess whether 



convergence can be reasonably expected.. Due to the na- 
ture of the model space, many traditional techniques can- 
not be used in this context. For our hierarchical model, as 
described in Equations [T] and [3] there are no parameters that 
can be used to meaningfully demonstrate convergence of the 
algorithm. Specifically, the parameters of the fixed effects 
tend to converge really quickly, regardless of the underly- 
ing clustering, as they are not cluster specific and therefore 
are not a good indication of the overall convergence. On the 
other hand the cluster parameters, such as the 9c s, cannot 
be tracked, as their number and interpretation changes from 
one iteration to the next (along with the additional compli- 
cation that the labels of clusters may switch between itera- 
tions). While the concentration parameter a may appear to 
offer some information, using this approach can be deceiv- 
ing, since a sampler that becomes stuck in a local mode in 
the clustering space will demonstrate a sample of a that ap- 
pears to have converged. 

Based upon our experience with real datasets, we sug- 
gest that to better assess convergence, it is important to com- 
pare multiple runs of the sampler started from significantly 
different initialisations, and to monitor the marginal model 
posterior in each run, a calculation that we detail in the fol- 
lowing section. 

Marginal Model Posterior We define the marginal model 
posterior as p(Z|D). This quantity represents the posterior 
distribution of the allocations given the data, having marginalised 
out all the other parameters. As such, it can be thought of a 
model posterior, as Z fully specifies the partition, which can 
be thought of as "the model" in our context. The marginal 
model posterior can be computed in closed form for discrete 
covariates but it requires some approximation when includ- 
ing the response model or Normal covariates. In practice, 
to reduce the complexity of the approximations required we 
have also conditioned on a. The choice of a to condition on 
is based on experiments on the dataset under study with a 
variable. 

Computing the marginal model posterior for each run of 
the MCMC and comparing between runs has proven to be 
a very effective tool for our real examples, particularly to 
identify runs that were significantly different from others, 
perhaps due to convergence issues. 

Figurel2]demonstrates that the strong signal in the simu- 
lation study in |Liverani et al (2013 1 means that the sampler 
converges regardless of the initial number of clusters. 

Initial Number of Clusters One consequence of our investi- 
gations using the marginal model posterior, was that despite 
the MCMC moves being very powerful, permitting signifi- 
cant moves across the model space, there is still considerable 
difficulties for the algorithm to split clusters and thereby es- 
cape local modes. This is due to the intrinsic characteristics 
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Fig. 2: Log marginal model posterior for the simulated 
dataset with different initial number of clusters. 



of partition spaces and the extremely high number of possi- 
ble ways to split a cluster, even if it only has a small number 
(for example, 50 or more) subjects in it. 

Despite our best efforts to overcome these challenges, 
any Gibbs sampling based approach that updates (blocks of) 
parameters sequentially is likely to struggle to escape the 
local modes, due to the separation of these modes in model 
space. Furthermore, introducing more ambitious Metropolis- 
Hastings moves that attempt to update a larger number of 
parameters simultaneously is also a very difficult task due to 
the difficulty in designing moves to areas of the model space 
with similar posterior support. 

Rather than subtly ignoring the problem, we suggest that 
used with caution our sampler still provides a useful infer- 
ential tool, but that the limitation must be realised and ac- 
knowledged. For example, because of the difficulty that the 
sampler has in increasing the number of clusters for prob- 
lems with real data with weak signal, it is important to ini- 
tialise the algorithm with a number of clusters which is greater 
than the anticipated number of clusters that the algorithm 
will convergence to. This necessarily involves an element of 
trial and error to determine what that number is, where mul- 
tiple runs from different initialisations must be compared, 
for example using the marginal model posterior This is demon- 
strated in Section |6T| 



an "optimal" partition, which might be determined, for ex- 
ample, by using a similarity matrix based upon the output 
of the MCMC run. Accepting that this way of summarising 
the output may be important in many cases, our package al- 
lows such an approach, but in general we advise against its 
use where possible, as it is the equivalent of giving a point 
estimate without a posterior distribution or a credible inter- 
val, and therefore discards a significant amount of inferential 
information from the run. Moreover, due to the complexity 
and sheer size of the model space, the optimal partitions tend 
to differ between runs of the MCMC, and it is not an easy 
task to assess whether convergence has been achieved based 
on this approach alone. 

In our experience, a better approach is to use posterior 
predictions, where posterior predictive distributions for quan- 
tities of interest can be derived from the whole MCMC run, 
taking the uncertainty over clustering into account. Depend- 
ing on the quantity of interest, the posterior predictive dis- 
tribution can often be relatively robust even across runs with 
noticeably different optimal partitions. While this may not 
help us to determine if the algorithm has sufficiently ex- 
plored the model-space, if the purpose of the inference is 
to make predictions, this robustness can be reassuring. 



6 Investigation of the algorithm's properties in a large 
data appUcation 

In this section, we report the results of using our sampler 
on a real epidemiological dataset with 2,639 subjects. The 
figures and results presented here have been produced with 
our R package PReMiuM ( [Liverani erall|2013| l. We do not 
report the results from a simulated study here as this is well 
documented in [Liverani et al| ( |2013| l where it is shown that 
our algorithm performs well. In that case, while ensuring 
convergence is a complex problem, we have observed good 
stability in all our runs, with results from independent chains 
virtually identical. As expected, the analysis of real data is 
more challenging: it requires care in ensuring convergence, 
as the signal is not as strong as in a simulation study. How- 
ever, these are challenges that might be encountered more 
widely by users wishing to apply the methods to real data, 
and by presenting an example and it allows us to highlight 
and discuss the issues that arise. 



6.1 The data 



Optimal Partitions vs Predictions Another important ques- 
tion when making inference using the Dirichlet process mix- 
ture model, is how best to interpret the output. Specifically, 
difficulties are faced in understanding or conveying the un- 



Our dataset is a subset taken from an epidemiological case- 
control study, the analysis of which has provided the moti- 
vation of most of the work presented in this paper. In the 
illustrative example we have 2,639 subjects, and use 6 dis- 



certainty of the partitioning. Many approaches make use of crete covariates each with 5 categories. We also include 13 
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Fig. 3: Log marginal model posterior for the real epidemio- 
logical dataset with different initial number of clusters. 



fixed effects. The low signal contained in the data poses is- 
sues with convergence of the MCMC, as we illustrate below. 
Our results are based upon running the multiple chains 
each for 100,000 iterations after a burn-in sample of 50,000 
iterations. In some cases, behaviour within this bum-in pe- 
riod is illustrated. 

Marginal Model Posterior and Number of Clusters As dis- 
cussed in Section l5] we run multiple MCMC runs, starting 
each with very different numbers of initial clusters. For this 
dataset, initialising the sampler with fewer than 20 clusters 
results in marginal model posterior distributions that are sig- 
nificantly different between runs. This is illustrated in Figure 
[3] where initialisations with small number of clusters result 
in much lower marginal model posterior values than can be 
achieved with a higher initial number of clusters. It is ap- 
parent that there is a clear cut-off at 20 clusters, where in- 
creasing the number of initial clusters further does not result 
in an increase in the marginal model posterior, suggesting 
that with 20 clusters or more the sampler is able to visit ar- 
eas of the model space with the highest posterior support. 
This is further supported by additional comparisons of the 
predictions and parameters of the model runs (not reported), 
which provide additional of good mixing. For comparison. 



see the equivalent figure for the simulation study in Liverani 
et al (2013 1 shown in Figure l2j 



Posterior Distribution of a Figure HI shows the trace plots 
of a for a number of runs, each with a different initial num- 
ber of clusters. As can be seen by considering the trace for 
the run with 5 initial clusters, if we were monitor to this trace 



Fig. 4: Posterior distribution of a for different number of 
initial clusters: trace of the first 5,000 iterations for three 
different initialisation. 



in isolation, a appears to have converged. However, by com- 
paring to the trace plots of initialisations with greater than 20 
clusters, we see that it has "converged" to a different value 
than the value corresponding to models with higher posterior 
support (Figure B}. Monitoring the trace of a is not a good 
indication of convergence and inference on a relies on being 
able to properly explore the model space. This is particularly 
relevant for more complicated stick breaking constructions 
(as mentioned in the SectionfTl) where a may be replaced by 
a number of parameters. 

As a final point on this matter, we observe that not unsur- 
prisingly, the trace plots in Fig.l4]show that while it is advis- 
able to start with a large number of initial clusters, starting 
with many more clusters than necessary can result in a larger 
number of iterations required for convergence. 



Posterior distribution of the number of clusters The need to 
initialise the sampler with a sufficiently high number of clus- 
ters is supported by the behaviour in the burn-in period illus- 
trated in Figure l6j for a run with 31 initial clusters. This plot 
shows the trace of the number of clusters for the first 500 
iterations of the sampler Figure |7] is the equivalent for the 
500 iterations after the first 15,000. In the initial iterations, 
the space is explored by modifying and merging clusters, 
with the number of iterations changing frequently, in a gen- 
eral downward trend. On the other hand, once the MCMC 
has converged to the model space around a mode, the algo- 
rithm attempts to split clusters regularly, but the number of 
changes in the number of clusters are few, and increases in 
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Fig. 5: Posterior distribution of a for different number of 
initial clusters: boxplots for the distribution after the bum- 
in. 



Fig. 7: The trace of the posterior of the number of clusters 
after 15,000 iterations of the MCMC. 
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Initial sweeps 

Fig. 6: The trace of the posterior of the number of clusters 
in the first 500 iterations of the MCMC. 



the number of clusters are almost immediately reversed in 
the following iteration. 

The posterior distributions for the number of clusters is 
shown in Figure [8] for runs with different initial numbers of 
clusters. Here the size and shading of each circle represents 
the posterior frequency of each number of clusters. As can 
be seen from this figure, this provides further evidence that 
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Fig. 8: The posterior distribution of the number of clusters 
for 50,000 sweeps after a burn-in of 50,000 iterations. 



with 20 or more initial clusters the sampler will converge 
to a common area of posterior support, but with fewer than 
this the sampler will not visit this region of the model space, 
despite it having increased posterior support. 

Label switching moves This example also demonstrates the 
need for the label switching move discussed in Section [5.1.1 
to ensure good mixing. Comparing Figure l9]to Figure [T] we 
can see that the new label switching move suffers from no 
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Fig. 9: Acceptance rates with the new label switching move. 
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Fig. 10: Acceptance rates for the new label switching move. 



t:: ^ - 



Moves 1 -3 
Moves 1 -2 



i^ 



ai 



bI b 



I 



I 



I 



B 



B- 




30 40 50 60 70 

Initial number of clusters 



drop off in acceptance at any point throughout the run and it 



converges in a smaller number of iterations. Figure 10 shows 
the acceptance rate for our new label switching move, when 
the other two switching label is not implemented. While 
the performance is worse than using all three moves, it is 
the most effective single label switching move (see Section 



5.1.1 1. Thinning could be used to improve this, but the lack 



of convergence of a in Figure 1 1 when using only moves 
1 and 2 demonstrates clearly the importance and efficacy of 
the label switching move. 



Fig. 11: Posterior of a with and without label switching 
moves. 



7 Conclusions 

Our implementation of an efficient C++ MCMC sampler 
for sampling Dirichlet process mixture models synthesizes 
many of the various techniques, such as parameter blocking, 
slice sampling, and label switching, introduced by several 
other authors and discussed within this paper, into a single 
piece of open source software, available as an R package. 
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However, sampling from the complex model space that 
is a feature of the Dirichlet process mixture model remains a 
challenging task. In previous work by other authors, consid- 
erable progress has been made evolving the samplers through 
innovative strategies and approaches. Nonetheless, discus- 
sion of many of the residual difficulties is avoided through 
demonstrating the methods only on simulated data, or for 
datasets with strong signal. In practice however, with real 
datasets, the user does not have the option of simply avoid- 
ing these issues. 

In this paper we have attempted to highlight the dif- 
ficulties that a user may face in practice. We have added 
our own new features to our sampler, such as new block- 
ing strategies and additional label switching moves to build 
upon this previous research and further alleviate some of the 
challenges that are involved with such problems. We have 
also provided practical guidelines based on our experience, 
on how to make useful inference in the face of these limita- 
tions. Our work also demonstrates why it is important to not 
just ignore these issues, with specific focus on areas such 
as inference for stick breaking parameters that have not (to 
our knowledge) been previously discussed. Furthermore we 
have illustrated our discussions through the presentation of 
an example based upon an epidemiological dataset with a 
low signal to noise ratio. 

As a consequence of specifying the challenges explicitly, 
we hope that our work will motivate further developments 
in this area to take additional steps to improve sampler ef- 
ficiency. The challenge of designing MCMC moves that are 
able to escape local well-separated modes is considerable, 
but equally, so is the imagination and innovation of many 
practitioners developing new MCMC sampling methodolo- 
gies. 
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A Appendices 

A.l 

We provide the following proposition concerning the relationship be- 
tween the ordering and a. 

Proposition 1 Suppose that we have a model with posterior as given 
in Equation^ Then f{ipc > V'c+i|a) "' a function of a, and further- 
more¥{ipc > ipc+i) > 0.5. 



Proof Ifipc > i'c+i then Vc > V'c+i(l — Vc), which implies Vc+i < 
Vc/(1 - Vc). Thus 

P( ^c > ^c+i|q) = P(Vc+i < Vc/{1 - Vc)\a) 

/ a2(i-yi)'^-i(i- V2)"~My2dVi 

Jo 

+ f f a^(i- Vi)°~i(i-y2)"~My2dyi 

Jo. 5 Jo 



a{l - Vir-^ - a{l - Vi)"-'^ 



l-2yi 
1-Vi 



dVi 



+ [ a(l-yi)"-MVi 

Jo. 5 

Jo 



(i-yi)°-idVi 



0.5 



a- 



{1-2V-,Y 



1- 



1-Vi 

0'5 (i_2Vi)" 



dVi 



a- 



dVi 



Now since, (1 - 2yi)"/(l - Vi) < (1 - 2Vi)'= 



CJ 



^ ' dVi<a (l-2yi)"-MVi =0.5. 

Jo 



/o ^-Vi 

So ¥{tpc > ^c+i\ci) > 0.5 for all a. Finally, 



> V'c+i) = / F(^c > ^c+i\a)p{a)dc 
> / 0.5p{a)da = 0.5. 



A.2 



Proposition 2 Consider the label switching move defined in Equa- 
tions^to^in Section \5.1.1\ Then: 

(i) (V+)' := t/.^ + ^:,+ i =^c + ^c+1 = ^+; 

(ii) (1 - y,')(l - K'+i) = (1 - V.)(l - Vi+i); 

(Hi) The proposal mechanism is its own reverse; 
(iv) 



E(Vc|2', a) _ 1 + a + n^+i + Ei>c+i ™i 






and 



and 



E(Vc|Z,a) l + a + nc + Ei>c+i"i 

(v) the acceptance probability for this move is given by min{l, H}, 
where the acceptance ratio R is given in Equation^ 

Proof (i) By definition 

(V+)' :=Vc + V'c+i 

i^+ ( E[V.,|Z',a] , E[^,+ i|Z',a] 
Wc+l =T- rzz — 7 + V, 



^' V ]E[^c+i|2,a] ^" E[Vc|Z,a] 

'/'+ , 4- 

1" ^ ' 



(ii) From (i), 

Ip'c + 1pc+l = i>c+ tpc+l 

implies 

[v;' + v;'+i(i-v;')] 11(1-^/) 



[Vc + Vc+i{l-Vc)]Yl{l-Vi). 

Kc 
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By EquationVn V( = Vi for all I < c, 

=> v,' + vf+i(i-v;') = Vc + yc+i(i-K;) 
^ (1 - v^){i - K'+i) = (1 - Ve)(i - y,+i). 

The importance of this result is that it provides confirmation that 
our proposed if)' in Equation [6| can be achieved with the V de- 
fined in Equation^ In particular, with this choice ofV, the only 
weights that are changed are those associated with components c 
and c-\- 1, as desired. 
(Hi) Suppose that the Markov chain is currently in the proposed state 
defined in Equationsn\tow\i.e. {V , 0' , Z' , U, a, A). We show 
that applying the proposal mechanism to this state, for component 
c and c + 1, the proposed new state is the original state 

{V", &", Z",U, Q, A) = (V, 0, Z, U, a, A.) 

The parameters U, a and A are unchanged by design of the pro- 
posal mechanism. Also, by design, the allocations Z and clus- 
ter parameters are simply swapped for the selected compo- 
nents, so trivially Z" = Z and 0" = 0. Since V" is un- 
changed for I ^ {c, c + 1}, /; remains only to show VjJ = Vc 
and V^',^ = Vc+i, or equivalently ip'^ = ipc and i/'"_i_j^ = i/'c+i- 
To confirm, 

V>+ V.+ E[^,+ i|Z',a] K[iPc\Z",a] 
' <^" W E[ipc\Z,a] m>c+i\Z',a] 
{by (i) and Equation^ 



tpc 



(^^ 



the— — '— since Z" = Z. 



(11) 

(12) 



However, 



E[^c|Z" 



■V', 



,E[t/.,+ i|Z",a] 
K[iPe\Z',a] 



= — (tZ-c+^c+l) 

{from Equation^and since Z" = Z) 

Substituting this into Equation \l 2\ we get tp" = tpc. The result for 
V'c'+i '^'^" ^^ shown by simply following identical logic. 
(iv) From EquationU] we have 

E[4,c\Z,a] =E[VcYl{l-Vi)\Z,a] 

1<C 

= E[Vc\Z,a]YlE[{l~Vi)\Z,a] 
Kc 

^ + "^ ^ (13) 

(14) 






Similarly, 

E[^,+ i|Z,a] = 



1 + ne+i 

l + a + n^+i +Y.l^>r.+^ "-l 



■- + ^1 



,ni 



l-\-a + nc + J2i>c"-' 

Kc x- + " + "i + Er>l"i' 
By definition of Z' in Equation^ we have 



(15) 



n I 



ic+l 



ni 



Z = c+l 
otherwise. 



(16) 



This means from EquationsW3\andW5\we have 



E[ilJc\Z',a\ 
E[Vc+i|2,a] 



1 + n' 



l + a + n' 



ni 



(17) 



'c+l + Ei>c+1 
l + g + We+i +Ei>c+l"i 

l + n^+i 
1 + o + ric + "c+i + Ei>c+i "' 
a + nc+i +E!>c+i"i 
Substituting EquationU6\inton7\and simplifying gives the desired 
results. The result for S'^t]^ ^i" follows in the same fashion, 
(v) By (Hi) and the deterministic nature of the proposal mechanism, 
the only random feature of the proposal is the choice of component 
c. The probability of this choice is the same for the move and its 
reverse and so cancels. Therefore the only contribution to the ac- 
ceptance ratio is the ratio of posteriors. By design, the likelihood 
is unchanged, and by (ii) the only change in posterior is down to 
the change in weights of components c and c+l. Therefore we 
have. 



R = 






(18) 



by Equation\16\ 



(19) 



ipc J VV'c+l, 

Substituting in Equation [6| and the results in (iv), we obtain the 
desired acceptance ratio. 
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